Hosted by the IIGB Bioinformatics Core
Genomics Building 1207G
You should receive an email with login information if you’ve never had an account on the HPCC. For those with an account already, you can use the same credentials to log into the server.
Open up your browser and type the following URL: https://ondemand.hpcc.ucr.edu
Note
Use the provided login credentials to log in to OnDemand. Then you’ll be ask for authentication password and two-factor validation with DUO.
After logging in, you’ll see the HPCC OnDemand Welcome Page
We will use the interactive Jupyter notebook for the workshop tutorials. To request the interactive app, click on Interactive Apps and select Jupyter Notebook.
You will see an app launch form. Fill in the following information:
Then click Launch and wait for resources to be gathered for your app.
Once the app is ready, click Connect to Jupyter to launch the Jupyter notebook.
We will clone the GitHub repository for this workshop. The repo is stored at: GitHub Repo
Cloning the repository will download all the files and directories from GitHub to your server/computer.
Cloning the repository:
Once the cloning is complete, we can go into the newly-downloaded workshop repository to run our analyses.
# change to the workshop directory
cd RNA-Seq-Workshop-Jupyter-notebook
# lists directory content
ls
The home directory structure for the workshop should look like this:
├── README.md
├── RNA-seq-workflow.ipynb
├── analysis
├── code
├── genome
├── index
├── log
├── metadata
└── rawTip
Cloning the repository will give you the most up-to-date files. To get the latest updates of the repository later on, you don’t need to clone again. Instead, you can pull the updates from GitHub to your server/computer. Make sure you are in the home directory of the workshop before running the command.
Caution
The pull command will overwrite any changes you’ve made locally. If you want to keep the changes and update the new things, you can run:
git stash: This command stashes your local changes, creating a temporary backup.git pull: This command fetches the changes from the remote repository and merges them into the local branch.git stash apply: This command applies the stashed changes back to the working directory.Alternatively, if you want a complete reset of your local directory with the updates from GitHub, you can run:
git reset --hard HEAD: This command resets the current branch to the HEAD commit of the remote repository, discarding any local changes.git clean -f -d: This command removes any untracked files and directories from the working directory.git pull: This command fetches the changes from the remote repository and merges them into the local branch.We will install Bash for Jupyter notebook to enable running shell scripts within the notebook.
First, open a terminal session within the Jupyter Lab.
In the terminal, we’ll first create a conda environment and then install the bash-kernel.
Tip
If the Bash kernel doesn’t appear as an option for your notebook, you can shutdown all kernels and try again
RNA sequencing or RNA-seq is one of many methods used for gene expression studies by obtaining a snapshot of the RNA molecules within a biological system.
Reference: Van den Berge et. al (2019) Annu Rev Biomed Data Sci
Next-generation sequencing (NGS) technologies
Sequencing resolution
| Category | Description |
|---|---|
| BioProject | PRJNA950346 |
| GEO Series | GSE228555 |
| Title | Transcriptome expression of WT and mir163 mutant |
| Organism | Arabidopsis thaliana |
| Ecotype | Col-0 |
| Genotype | WT, miR163 mutant |
| Replicates | 3 |
| Tissue | Seedlings |
| Library kit | NEBNext® Ultra™ RNA Library Prep Kit for Illumina (non-stranded) |
| Sequencing | Illumina paired-end 150bp |
Fastqc and trim_galoreSTAR alignerSTARfeatureCountsDESeq2| Program | Type | Software | Reference | Misc |
|---|---|---|---|---|
| Fastqc | QC | Software | Ref | |
| Trim_galore | QC | Software | Ref | |
| Multiqc | QC | Software | Ref | Manual |
| STAR | Alignment | Software | Ref | Manual |
| FeatureCounts | Quantification | Software | Ref | |
| DESeq2 | Differential Expression | Software | Ref | Guide |
| IGV | Visualization | Software | Ref | Web App |
| R | Other | Software | Ref | |
| Tidyverse | Other | Software | Ref | |
| EnhancedVolcano | Other | Software | Ref |
Generate a metadata.csv file containing information about the dataset. The metadata will be used for processing and differential expression analysis.
Note
An extensive metadata file describes the samples in detail. However, the minimum metadata file should contain the following:
Other metadata info can include:
This file contains the metadata for the full dataset
samplename,fq1,fq2,srr_id,ecotype,genotype,treatment,tissue,biorep
mir163_1,SRR24016000_1.fastq.gz,SRR24016000_2.fastq.gz,SRR24016000,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,1
mir163_2,SRR24016001_1.fastq.gz,SRR24016001_2.fastq.gz,SRR24016001,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,2
mir163_3,SRR24016002_1.fastq.gz,SRR24016002_2.fastq.gz,SRR24016002,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,3
wt_1,SRR24016003_1.fastq.gz,SRR24016003_2.fastq.gz,SRR24016003,Col-0,wt,7-day-old seedlings without treatment,seedlings,1
wt_2,SRR24016004_1.fastq.gz,SRR24016004_2.fastq.gz,SRR24016004,Col-0,wt,7-day-old seedlings without treatment,seedlings,2
wt_3,SRR24016005_1.fastq.gz,SRR24016005_2.fastq.gz,SRR24016005,Col-0,wt,7-day-old seedlings without treatment,seedlings,3This file contains the metadata referencing the 1M sub-sampled dataset
samplename,fq1,fq2,srr_id,ecotype,genotype,treatment,tissue,biorep
mir163_1,SRR24016000_sub1M_1.fastq.gz,SRR24016000_sub1M_2.fastq.gz,SRR24016000_sub1M,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,1
mir163_2,SRR24016001_sub1M_1.fastq.gz,SRR24016001_sub1M_2.fastq.gz,SRR24016001_sub1M,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,2
mir163_3,SRR24016002_sub1M_1.fastq.gz,SRR24016002_sub1M_2.fastq.gz,SRR24016002_sub1M,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,3
wt_1,SRR24016003_sub1M_1.fastq.gz,SRR24016003_sub1M_2.fastq.gz,SRR24016003_sub1M,Col-0,wt,7-day-old seedlings without treatment,seedlings,1
wt_2,SRR24016004_sub1M_1.fastq.gz,SRR24016004_sub1M_2.fastq.gz,SRR24016004_sub1M,Col-0,wt,7-day-old seedlings without treatment,seedlings,2
wt_3,SRR24016005_sub1M_1.fastq.gz,SRR24016005_sub1M_2.fastq.gz,SRR24016005_sub1M,Col-0,wt,7-day-old seedlings without treatment,seedlings,3Fastq files contain the raw sequence and the base quality score. Each sequence is comprised of four lines.
Description for each line
The sequence header (starts with @) and contain identifiers for the read
The sequence
'+'
Base quality scores @VH01192:45:AAC7JMMM5:1:1101:19973:1000 1:N:0:AGTTCAGG+TCTGTTGG
NATGGGACAGACATGCTGGCGGCACTCACTCACTTGGGCGGCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCAGGATCTCGTATTCCCTTTTTTTTTTGTAAATTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
#-C;;CCCCC-C-CCCCCCCCC-CCCCCCCCCCCCCCCCCCC-CCCCCCCC-CC-CCCCCCCCC-CC-CCCCCCCCC;-C--CC-CCCC--CC-C--;---;C----;-;CC-------C-C-C-C--CCCC-;C;CCC-CCCCCCCCCCC
@VH01192:45:AAC7JMMM5:1:1101:20125:1000 1:N:0:AGTTCAGG+TCTGTTGG
NCCCAGCCCCAGCGACTCCTAATAAAGCATTTCAGCAAATAAAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCAAGTCACAGTTCAGGATGTGGTTTTTGGTTTTTTTTTTTTTAAATTTTGGGGGGGGGGGGGGGGGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC--CCCCCCCCCCCCCCCC;CCCC-CCCCCCCCCC---C-C---C-CC-CCCCC;CC-CC-C--;C-CC-------C-C---C-C----CC;C-CCCC;C;CCCCCCCCCCCC
@VH01192:45:AAC7JMMM5:1:1101:21034:1000 1:N:0:AGTTCAGG+TCTGTTGG
NTTGCAATGCTCAATAAGTCTATTCCACCTCAGTGTCCTTTTTAAAGAGTTTTGGAAAAAAAAAAAAAAAAAAAGATCGTAAGAGCACACGTCTGAACTCCAGTCACAGTTCAGGTTGTGGGTTTTCGTGTTTTTGTTTTTATTTTTGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCCCC;C;CCCCCCCCCCCCCCCCCCCCCCCCCC-CC;C-CC-C;CCCCCC;C;CC-CCCCC;;;CCC-;;C;-;-C---C-C;;;--C-CCCC---C;C-----C;--C-
@VH01192:45:AAC7JMMM5:1:1101:21488:1000 1:N:0:AGTTCAGG+TCTGTTGG
NCCTCAAAAAAAAAAAAAAAAAAAAAAAAAATTTGGTATGTGAAATTTTTTTAATACATTTAAATTTTATGTTTTTGTTTTTCTTTTTTTTTTTTTTAAATATTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCC-;---C-C;-;------CCCCCC;--CC--C-C-;-C-C--;C;-----C-C--CCCC-C--;C-C---C-;C;-C-C;CCCCCCCCC;CCCCC-CCCCCCCCCCC;C;CCCCCCCC;CCC
Sample header:
@VH01192:45:AAC7JMMM5:1:1101:19973:1000 1:N:0:AGTTCAGG+TCTGTTGG
Header Description
unique instrument id
run id
flowcell id
flowcell lane
tile number within the flowcell lane
x-coordinate of the cluster within the tile
y-coordinate of the cluster within the tile
Member of a read/mate pair, 1 or 2
Y if the read is filtered, N otherwise
0 when none of the control bits are on
Index sequence
We will run QC on the raw data using fastqc and trim_galore.
fastqc will generate stats of the raw reads including:
trim_galore will remove adapters in the sequence and can remove N bases or bases with low quality score. trim_galore will also run fastqc on the trimmed dataset.
$ ls analysis/fastqc
SRR24016000_sub1M_1_fastqc.html
SRR24016000_sub1M_1_fastqc.zip
SRR24016000_sub1M_2_fastqc.html
SRR24016000_sub1M_2_fastqc.zip
SRR24016001_sub1M_1_fastqc.html
SRR24016001_sub1M_1_fastqc.zip
SRR24016001_sub1M_2_fastqc.html
SRR24016001_sub1M_2_fastqc.zip
SRR24016002_sub1M_1_fastqc.html
SRR24016002_sub1M_1_fastqc.zip
SRR24016002_sub1M_2_fastqc.html
SRR24016002_sub1M_2_fastqc.zip
SRR24016003_sub1M_1_fastqc.html
SRR24016003_sub1M_1_fastqc.zip
SRR24016003_sub1M_2_fastqc.html
SRR24016003_sub1M_2_fastqc.zip
SRR24016004_sub1M_1_fastqc.html
SRR24016004_sub1M_1_fastqc.zip
SRR24016004_sub1M_2_fastqc.html
SRR24016004_sub1M_2_fastqc.zip
SRR24016005_sub1M_1_fastqc.html
SRR24016005_sub1M_1_fastqc.zip
SRR24016005_sub1M_2_fastqc.html
SRR24016005_sub1M_2_fastqc.zipThe ‘val_1’ and ‘val_2’ files are the validated files after trim_galore processing. These files will be the input for the STAR alignment.
$ ls analysis/trim_galore
mir163_1_val_1_fastqc.html
mir163_1_val_1_fastqc.zip
mir163_1_val_1.fq.gz
mir163_1_val_2_fastqc.html
mir163_1_val_2_fastqc.zip
mir163_1_val_2.fq.gz
mir163_2_val_1_fastqc.html
mir163_2_val_1_fastqc.zip
mir163_2_val_1.fq.gz
mir163_2_val_2_fastqc.html
mir163_2_val_2_fastqc.zip
mir163_2_val_2.fq.gz
mir163_3_val_1_fastqc.html
mir163_3_val_1_fastqc.zip
mir163_3_val_1.fq.gz
mir163_3_val_2_fastqc.html
mir163_3_val_2_fastqc.zip
mir163_3_val_2.fq.gz
SRR24016000_sub1M_1.fastq.gz_trimming_report.txt
SRR24016000_sub1M_2.fastq.gz_trimming_report.txt
SRR24016001_sub1M_1.fastq.gz_trimming_report.txt
SRR24016001_sub1M_2.fastq.gz_trimming_report.txt
SRR24016002_sub1M_1.fastq.gz_trimming_report.txt
SRR24016002_sub1M_2.fastq.gz_trimming_report.txt
SRR24016003_sub1M_1.fastq.gz_trimming_report.txt
SRR24016003_sub1M_2.fastq.gz_trimming_report.txt
SRR24016004_sub1M_1.fastq.gz_trimming_report.txt
SRR24016004_sub1M_2.fastq.gz_trimming_report.txt
SRR24016005_sub1M_1.fastq.gz_trimming_report.txt
SRR24016005_sub1M_2.fastq.gz_trimming_report.txt
wt_1_val_1_fastqc.html
wt_1_val_1_fastqc.zip
wt_1_val_1.fq.gz
wt_1_val_2_fastqc.html
wt_1_val_2_fastqc.zip
wt_1_val_2.fq.gz
wt_2_val_1_fastqc.html
wt_2_val_1_fastqc.zip
wt_2_val_1.fq.gz
wt_2_val_2_fastqc.html
wt_2_val_2_fastqc.zip
wt_2_val_2.fq.gz
wt_3_val_1_fastqc.html
wt_3_val_1_fastqc.zip
wt_3_val_1.fq.gz
wt_3_val_2_fastqc.html
wt_3_val_2_fastqc.zip
wt_3_val_2.fq.gzTo run the STAR aligner, we first need to create an index of the genome.
To generate an index, you’ll need the following files:
Note
You only need to run the indexing step once. The indexed files can be used for alignments for all subsequent samples
Warning
For assembled genomes containing large number of scaffolds or large genomes, this process can take some time and may require adjusting the indexing parameters.
>Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated: 2009-02-02
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCAT
GAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTT
ATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCT
TGTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTA
CATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTA
TGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATT
TAGTTGTAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGGGATGGTCCTTTAGChr1 Araport11 5UTR 3631 3759 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 3631 3913 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 start_codon 3760 3762 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 3760 3913 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 3996 4276 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 3996 4276 . + 2 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 4486 4605 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 4486 4605 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 4706 5095 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 4706 5095 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 5174 5326 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 5174 5326 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 5439 5630 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 5439 5899 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 stop_codon 5628 5630 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 3UTR 5631 5899 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 6788 7069 . - . transcript_id "AT1G01020.2"; gene_id "AT1G01020"
Chr1 Araport11 3UTR 6788 7069 . - . transcript_id "AT1G01020.2"; gene_id "AT1G01020"
Chr1 Araport11 3UTR 7157 7314 . - . transcript_id "AT1G01020.2"; gene_id "AT1G01020"##gff-version 3
Chr1 Araport11 gene 3631 5899 . + . ID=AT1G01010;Name=AT1G01010;Note=NAC domain containing protein 1;symbol=NAC001;Alias=ANAC001,NAC domain containing protein 1;full_name=NAC domain containing protein 1;computational_description=NAC domain containing protein 1 (NAC001)%3B FUNCTIONS IN: sequence-specific DNA binding transcription factor activity%3B INVOLVED IN: multicellular organismal development%2C regulation of transcription%3B LOCATED IN: cellular_component unknown%3B EXPRESSED IN: 7 plant structures%3B EXPRESSED DURING: 4 anthesis%2C C globular stage%2C petal differentiation and expansion stage%3B CONTAINS InterPro DOMAIN/s: No apical meristem (NAM) protein (InterPro:IPR003441)%3B BEST Arabidopsis thaliana protein match is: NAC domain containing protein 69 (TAIR:AT4G01550.1)%3B Has 2503 Blast hits to 2496 proteins in 69 species: Archae - 0%3B Bacteria - 0%3B Metazoa - 0%3B Fungi - 0%3B Plants - 2502%3B Viruses - 0%3B Other Eukaryotes - 1 (source: NCBI BLink).;Dbxref=PMID:11118137,PMID:12820902,PMID:15029955,PMID:15010618,PMID:15108305,PMID:15173566,PMID:15282545,PMID:16504176,PMID:16547105,PMID:16524982,PMID:16720694,PMID:16552445,PMID:17339215,PMID:17448460,PMID:17447913,PMID:18650403,PMID:20736450,PMID:24377444,locus:2200935;locus_type=protein_coding
Chr1 Araport11 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Note=NAC domain containing protein 1;conf_class=2;symbol=NAC001;Alias=ANAC001,NAC domain containing protein 1;full_name=NAC domain containing protein 1;computational_description=NAC domain containing protein 1 (NAC001)%3B FUNCTIONS IN: sequence-specific DNA binding transcription factor activity%3B INVOLVED IN: multicellular organismal development%2C regulation of transcription%3B LOCATED IN: cellular_component unknown%3B EXPRESSED IN: 7 plant structures%3B EXPRESSED DURING: 4 anthesis%2C C globular stage%2C petal differentiation and expansion stage%3B CONTAINS InterPro DOMAIN/s: No apical meristem (NAM) protein (InterPro:IPR003441)%3B BEST Arabidopsis thaliana protein match is: NAC domain containing protein 69 (TAIR:AT4G01550.1)%3B Has 2503 Blast hits to 2496 proteins in 69 species: Archae - 0%3B Bacteria - 0%3B Metazoa - 0%3B Fungi - 0%3B Plants - 2502%3B Viruses - 0%3B Other Eukaryotes - 1 (source: NCBI BLink).;conf_rating=****;Dbxref=PMID:11118137,gene:2200934,UniProt:Q0WV96
Chr1 Araport11 five_prime_UTR 3631 3759 . + . ID=AT1G01010:five_prime_UTR:1;Parent=AT1G01010.1;Name=NAC001:five_prime_UTR:1
Chr1 Araport11 exon 3631 3913 . + . ID=AT1G01010:exon:1;Parent=AT1G01010.1;Name=NAC001:exon:1
Chr1 Araport11 CDS 3760 3913 . + 0 ID=AT1G01010:CDS:1;Parent=AT1G01010.1;Name=NAC001:CDS:1
Chr1 Araport11 exon 3996 4276 . + . ID=AT1G01010:exon:2;Parent=AT1G01010.1;Name=NAC001:exon:2
Chr1 Araport11 CDS 3996 4276 . + 2 ID=AT1G01010:CDS:2;Parent=AT1G01010.1;Name=NAC001:CDS:2
Chr1 Araport11 exon 4486 4605 . + . ID=AT1G01010:exon:3;Parent=AT1G01010.1;Name=NAC001:exon:3
Chr1 Araport11 CDS 4486 4605 . + 0 ID=AT1G01010:CDS:3;Parent=AT1G01010.1;Name=NAC001:CDS:3
Chr1 Araport11 exon 4706 5095 . + . ID=AT1G01010:exon:4;Parent=AT1G01010.1;Name=NAC001:exon:4
Chr1 Araport11 CDS 4706 5095 . + 0 ID=AT1G01010:CDS:4;Parent=AT1G01010.1;Name=NAC001:CDS:4
Chr1 Araport11 exon 5174 5326 . + . ID=AT1G01010:exon:5;Parent=AT1G01010.1;Name=NAC001:exon:5
Chr1 Araport11 CDS 5174 5326 . + 0 ID=AT1G01010:CDS:5;Parent=AT1G01010.1;Name=NAC001:CDS:5
Chr1 Araport11 CDS 5439 5630 . + 0 ID=AT1G01010:CDS:6;Parent=AT1G01010.1;Name=NAC001:CDS:6
Chr1 Araport11 exon 5439 5899 . + . ID=AT1G01010:exon:6;Parent=AT1G01010.1;Name=NAC001:exon:6
Chr1 Araport11 three_prime_UTR 5631 5899 . + . ID=AT1G01010:three_prime_UTR:1;Parent=AT1G01010.1;Name=NAC001:three_prime_UTR:1Now we can align the reads from each sample to the indexed reference genome.
Depending on the sequencing depth, this step can take some time.
$ ls analysis/star
wt_1_Aligned.sortedByCoord.out.bam
wt_1_Aligned.sortedByCoord.out.bam.bai
wt_1_Log.final.out
wt_1_Log.out
wt_1_Log.progress.out
wt_1_ReadsPerGene.out.tab
wt_1_Signal.UniqueMultiple.str1.out.wig
wt_1_Signal.UniqueMultiple.str2.out.wig
wt_1_Signal.Unique.str1.out.wig
wt_1_Signal.Unique.str2.out.wig
wt_1_SJ.out.tab$ less analysis/star/wt_1_Log.final.out
Started job on | Oct 27 15:12:05
Started mapping on | Oct 27 15:12:05
Finished on | Oct 27 15:13:29
Mapping speed, Million of reads per hour | 42.77
Number of input reads | 997854
Average input read length | 297
UNIQUE READS:
Uniquely mapped reads number | 939014
Uniquely mapped reads % | 94.10%
Average mapped length | 291.73
Number of splices: Total | 904715
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 894390
Number of splices: GC/AG | 8458
Number of splices: AT/AC | 419
Number of splices: Non-canonical | 1448
Mismatch rate per base, % | 0.00%
Deletion rate per base | 0.00%
Deletion average length | 1.50
Insertion rate per base | 0.00%
Insertion average length | 1.21
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 0
% of reads mapped to multiple loci | 0.00%
Number of reads mapped to too many loci | 18952
% of reads mapped to too many loci | 1.90%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 39869
% of reads unmapped: too short | 4.00%
Number of reads unmapped: other | 19
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%The alignment information are stored in a sequence alignment map (SAM) file. This file can be large depending on the number of sequenced reads. Therefore, SAM files are generally converted to a BAM (binary compressed version of SAM) file to reduce storage space but requires software (e.g., samtools) in order to see the file content.
Sequence alignment map (SAM) format official documentation is available here. The SAM file consists of header rows and rows for each read. Each row contains 11 mandatory fields.
@HD VN:1.4 SO:coordinate
@SQ SN:Chr1 LN:30427671
@SQ SN:Chr2 LN:19698289
@SQ SN:Chr3 LN:23459830
@SQ SN:Chr4 LN:18585056
@SQ SN:Chr5 LN:26975502
@SQ SN:ChrC LN:154478
@SQ SN:ChrM LN:366924
@PG ID:STAR PN:STAR VN:2.7.9a CL:STAR --runMode alignReads --runThreadN 16 --genomeDir index --readFilesIn analysis/trim_galore/wt_1_val_1.fq.gz analysis/trim_galore/wt_1_val_2.fq.gz --readFilesCommand zcat --outFileNamePrefix analysis/star/wt_1_ --outSAMtype BAM SortedByCoordinate --outWigType wiggle --outFilterMultimapNmax 1 --outFilterMismatchNmax 0 --quantMode GeneCounts
@PG ID:samtools PN:samtools PP:STAR VN:1.18 CL:samtools view -h ../demo_files/analysis/star/wt_1_Aligned.sortedByCoord.out.bam
@CO user command line: STAR --runThreadN 16 --runMode alignReads --genomeDir index --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outFileNamePrefix analysis/star/wt_1_ --outFilterMismatchNmax 0 --outFilterMultimapNmax 1 --quantMode GeneCounts --outWigType wiggle --readFilesIn analysis/trim_galore/wt_1_val_1.fq.gz analysis/trim_galore/wt_1_val_2.fq.gz
A00738:657:H72KHDSX7:1:2513:31566:1110 163 chr01 2960 255 150M = 3065 254 CACCCACAGGCACCACCGTCCTTGTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:297 nM:i:0
A00738:657:H72KHDSX7:1:1330:26928:13870 99 chr01 2983 255 149M = 3071 223 GTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF NH:i:1 HI:i:1 AS:i:282 nM:i:0
A00738:657:H72KHDSX7:1:1159:8594:10551 163 chr01 3001 255 150M = 3069 218 GAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:298 nM:i:0
A00738:657:H72KHDSX7:1:1640:6551:15311 163 chr01 3006 255 149M = 3088 232 GACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGAC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:297 nM:i:0
A00738:657:H72KHDSX7:1:2574:25084:30937 99 chr01 3007 255 149M = 3207 448 ACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFF:FF NH:i:1 HI:i:1 AS:i:297 nM:i:0
A00738:657:H72KHDSX7:1:1327:12717:27023 99 chr01 3008 255 149M = 3086 227 CGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:296 nM:i:0SAM Header
A00738:657:H72KHDSX7:1:1327:12717:27023
99
chr01
3008
255
149M
=
3086
227
CGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAA
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF
NH:i:1 HI:i:1 AS:i:296 nM:i:0
For decoding the SAM Flags, check out this website from the Broad Institute.
The CIGAR string is a representation of how the read aligned to the reference, including matches, mismatches, deletions, insertions, and splicing. More info on the CIGAR string is available here
The BAM file produced from STAR contains alignment information that can be viewed in IGV. However, this file can be big and sluggish. One option is to convert the BAM file to a smaller BigWig format that will allow fast visualization of the results.
Note
The BigWig format will aggregate and bin your data across the genome. Therefore, you will lose individual read information. For observing SNPs, you’ll want to use the BAM files instead.
After aligning the reads to the genome, we must quantify the total reads associated with each genome feature (e.g., gene).
# Program:featureCounts v2.0.3; Command:"featureCounts" "-T" "8" "-s" "0" "-a" "genome/Araport11_GFF3_genes_transposons.201606.gtf" "-t" "exon" "-g" "gene_id" "--primary" "-o" "analysis/featurecounts/mir163_1.fcnts.txt" "-p" "--countReadPairs" "analysis/star/mir163_1_Aligned.sortedByCoord.out.bam"
Geneid Chr Start End Strand Length analysis/star/mir163_1_Aligned.sortedByCoord.out.bam
AT1G01010 Chr1;Chr1;Chr1;Chr1;Chr1;Chr1 3631;3996;4486;4706;5174;5439 3913;4276;4605;5095;5326;5899 +;+;+;+;+;+ 1688 6
AT1G01020 Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1 6788;6788;6788;6788;6788;6788;7157;7157;7157;7157;7157;7157;7384;7384;7384;7384;7564;7564;7564;7564;7564;7564;7762;7762;7762;7762;7762;7942;7942;7942;7942;7942;8236;8236;8236;8236;8236;8236;8417;8417;8417;8417;8571;8571;8571;8594;8594;8594 7069;7069;7069;7069;7069;7069;7232;7232;7232;7232;7450;7450;7450;7450;7450;7450;7649;7649;7649;7649;7649;7649;7835;7835;7835;7835;7835;7987;7987;7987;7987;7987;8464;8325;8464;8325;8325;8325;8464;8464;8464;8464;9130;9130;8737;9130;9130;8737 -;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;- 1571 18
AT1G03987 Chr1 11101 11372 + 272 1
AT1G01030 Chr1;Chr1;Chr1;Chr1;Chr1 11649;11649;12424;13335;13335 12354;13173;13173;13714;13714 -;-;-;-;- 1905 12
AT1G01040 Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1 23121;23416;24542;24542;24752;24752;25041;25041;25524;25524;25825;25825;26081;26081;26292;26292;26543;26543;26862;26862;27099;27099;27372;27372;27618;27618;27803;27803;28708;28708;28890;28890;29160;29160;30147;30147;30410;30410;30902;30902 24451;24451;24655;24655;24962;24962;25435;25435;25743;25743;25997;25997;26203;26203;26452;26452;26776;26776;27012;27012;27281;27281;27536;27533;27713;27713;28431;28431;28805;28805;29080;29080;30065;30065;30311;30311;30816;30816;31120;31227 +;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+ 6279 67
AT1G03993 Chr1 23312 24099 - 788 0DESeq2 is an R package that allows differential expression analysis.
To run DESeq2, you will need two key inputs:
You will also need to define the type of contrasts/design for the analysis.
For example, suppose we have the following metadata for a dataset:
| sample | replicate | genotype | treatment | grp |
|---|---|---|---|---|
| wt_1 | 1 | wild-type | ctrl | wt_ctrl |
| wt_2 | 2 | wild-type | ctrl | wt_ctrl |
| wt_3 | 1 | wild-type | inhibitor | wt_inhib |
| wt_4 | 2 | wild-type | inhibitor | wt_inhib |
| mut_1 | 1 | mutant | ctrl | mut_ctrl |
| mut_2 | 2 | mutant | ctrl | mut_ctrl |
| mut_3 | 1 | mutant | inhibitor | mut_inhib |
| mut_4 | 2 | mutant | inhibitor | mut_inhib |
We can run the following contrasts:
Contrasts
Comparisons
The custom DESeq2 script generates several plots, tables, and genelists organized into the following directories.
Outputs are organized into the genelists or plots folders.
The genelists folder contains a summary of DEGs, all the deseq2 outputs (stored as an R object (RDS) and a CSV file, and DEGs with annotations.
$ ls -1 analysis/deseq2/genotype/genelists/
deg_summary.csv
deseq2_results.RDS
miR163_mut-wt.deseq2_output.csv
miR163_mut-wt.DownDEG.annotated.csv
miR163_mut-wt.UpDEG.annotated.csvThe plots folder contains a bar plot summarizing the number of DEGs, a PCA plot, and an enhanced volcano plot highlighting the DEGs.
$ ls -1 analysis/deseq2/genotype/plots/*png
analysis/deseq2/genotype/plots/DEG_summary.png
analysis/deseq2/genotype/plots/Enhancedvolcano_miR163_mut-vs-wt.png
analysis/deseq2/genotype/plots/PCA.png| Comparisons | Counts_Up_or_Down | Counts_Up | Counts_Down |
|---|---|---|---|
| miR163_mut-wt | 3 | 2 | 1 |
MultiQC is used to generate a summary of the raw and processed data. The program aggregates all the log files from different programs (e.g., trim_galore, STAR) and provide a summary HTML file.
To inspect and examine your RNA-seq results, you can load the BigWig or BAM files into IGV.
Download the .bam, .bam.bai, and .bw files. The .bam files are quite large (> 1Gb) whereas the .bw files are smaller (Mb). These files can be loaded into IGV for visualization and inspection.
Show IGV demonstration
We will have a hands-on practice with the analysis workflow using a Jupyter notebook.
Please take a few minutes to provide feedback for this workshop by filling out the workshop survey.
All responses are anonymous and will help to improve future workshops and training.
There will be times where you would want to see plots or access files generated on the cluster. For easier access, you can use an SFTP program with a GUI that connects to the server and provides direct access to those files.
There are several free GUI applications including Filezilla, Cyberduck, WinSCP.
These applications are available for MacOS, WinOS, and Linux platforms.
Once you have the application installed, start up the program and follow the instructions to connect to a site.
Click here for instructions to connect and authenticate to the cluster using Filezilla with password + DUO authentication.
For this workshop, we will submit jobs to the cluster (to run in the background).
Example job submission:
The submitted job will have a six-digit JOBID. To check on the status of your submitted job(s),
# Show all jobs from USER
squeue -u <USERNAME>
# Show specific job with JOBID
scontrol show job <JOBID>To cancel a submitted job, use the JOBID.
# To delete a single job
scancel -i <JOBID>
scancel <JOBID>
# To delete all jobs for a user
scancel -u <USERNAME>Tip
All submitted jobs will generate a log file (stored in the log folder) containing the JOBNAME_JOBID.log
You can take a look at the log file to see standard output/errors from running the script.
We will create a conda environment containing bioinformatics software packages not readily available on the cluster for data processing and analysis. This conda environment includes special R packages (e.g., EnhancedVolcano) and software (e.g., multiqc).
To create the conda environment:
Note: This step can take anywhere from 25-45 minutes. However, it’s running in the background so we can come back to it later.
To make sure the conda environment was created successfully, we can activate the conda environment:
You should now see the environment (DEG-analysis) next to your $ prompt.